Part 1
Part I
Lesson 1
Learning about the mechanics of wrangling data. Particularly mutate() and summarise() functions in an efficient manner of the databases to be used: penguins and ames.
The dimensions of the penguins database: 344 observations and 7 columns or variables
Code
dim(penguins)[1] 344 7
To have a good overview of the database, we can use the skim() function from the skimr package. It provides a summary statistics, ad report the proportion of missing cases
| Name | palmerpenguins::penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
An alternative to skimr for descriptive statistics is gt_plt_summary() function of the gtExtras package. For it to work, you need to select the variables in advance. Note:
Code
| select(penguins, bill_length_mm, bill_depth_mm) | ||||||
|---|---|---|---|---|---|---|
| 344 rows x 2 cols | ||||||
| Column | Plot Overview | Missing | Mean | Median | SD | |
| bill_length_mm | 0.6% | 43.9 | 44.5 | 5.5 | ||
| bill_depth_mm | 0.6% | 17.2 | 17.3 | 2.0 | ||
If you want to focus on one variable, you can deposit it into the skim() function. For instance, examining the properties of Lot Frontage and Lot_Area
| Name | modeldata::ames |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Lot_Frontage | 0 | 1 | 57.65 | 33.50 | 0 | 43.00 | 63.0 | 78.00 | 313 | ▇▇▁▁▁ |
| Lot_Area | 0 | 1 | 10147.92 | 7880.02 | 1300 | 7440.25 | 9436.5 | 11555.25 | 215245 | ▇▁▁▁▁ |
Notice the ames data has not nice variable names. Good idea to use janitor::clean_names to convert capital letters into lower letters.
Code
modeldata::ames |>
janitor::clean_names() |>
skimr::skim(lot_frontage,lot_area)| Name | janitor::clean_names(mode… |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lot_frontage | 0 | 1 | 57.65 | 33.50 | 0 | 43.00 | 63.0 | 78.00 | 313 | ▇▇▁▁▁ |
| lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300 | 7440.25 | 9436.5 | 11555.25 | 215245 | ▇▁▁▁▁ |
Lesson 2
The focus is the mutate() function. It also deals with vectorized vs iterative calculations with mutate. It introduces the map() function. This function aligns a function to a vector, compelling the estimating of one-by-one instead of vector mashing. The variations of map() discussed include the map_lgl() (map logical) and the map_dbl() (map numeric). The first one produces a logical vector. The second produces numeric values of 1 (True) and 0 (False).
Handling missing cases
It is important to handle missing values before doing mutate(). We can verify missing cases were removed using skimr.
Code
| Name | penguins_wo_NAs |
| Number of rows | 333 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
| island | 0 | 1 | FALSE | 3 | Bis: 163, Dre: 123, Tor: 47 |
| sex | 0 | 1 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
| year | 0 | 1 | 2008.04 | 0.81 | 2007.0 | 2007.0 | 2008.0 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Handling vectors
Mutate is a vectorised function. This means that it takes the two vectors listed in the mutate function as a whole. In other words the values of each column are not taken one-by-one. The values are handled as a vecto, and then they are operated via a vectorized division (/).
Imagine these two vectors are “smashed together” and the result is divided.
Code
# A tibble: 333 × 3
bill_length_mm bill_depth_mm bill_flipper_ratio
<dbl> <dbl> <dbl>
1 39.1 18.7 2.09
2 39.5 17.4 2.27
3 40.3 18 2.24
4 36.7 19.3 1.90
5 39.3 20.6 1.91
6 38.9 17.8 2.19
7 39.2 19.6 2
8 41.1 17.6 2.34
9 38.6 21.2 1.82
10 34.6 21.1 1.64
# ℹ 323 more rows
Example where the vectorized condition may not be met. Using a custom function to illustrate this problem. Next let’s apply the function to one of the vectors used in estimating bill_flipper_ratio. In this case, bill_length_mm.
Notice that the application of the function prompted an error message. It was traced back to the if (x > 39) ...: The result shows the condition has length > 1.
Functional programming or map()
map_lgl() enforces the one-by-one calculation instead of the vector calculation
To address the limitation of working with vectors, we need functional programming. map() allows one-to-one calculation operation. This operation is called upon by the large_quantity function within the mutate() function.
In other words, instead of using the vector smashing approach to handle observations contained in the variable, we rely on handling each variable listed in the mutate function.
We rely on the map() function. In this case the logical map function ( map_lgl()). And then we instruct R to apply the large_quantity() function to each component, one-by-one, of the column or vector bill_lenght_mm. Notice, map_lgl() lists the vector or column first, and then the function to be applied to it. In other words, map_lgl() applies the function to each element of the vector or column instead of smashing the vector.
“The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.”
Code
# A tibble: 333 × 2
bill_length_mm large_bill_length
<dbl> <lgl>
1 39.1 TRUE
2 39.5 TRUE
3 40.3 TRUE
4 36.7 FALSE
5 39.3 TRUE
6 38.9 FALSE
7 39.2 TRUE
8 41.1 TRUE
9 38.6 FALSE
10 34.6 FALSE
# ℹ 323 more rows
Another variation is the map() itself. But instead of producing individual values, map() produces lists of values. However, the list of variables are reported in a vector.
Code
# A tibble: 333 × 2
bill_length_mm large_bill_length
<dbl> <list>
1 39.1 <lgl [1]>
2 39.5 <lgl [1]>
3 40.3 <lgl [1]>
4 36.7 <lgl [1]>
5 39.3 <lgl [1]>
6 38.9 <lgl [1]>
7 39.2 <lgl [1]>
8 41.1 <lgl [1]>
9 38.6 <lgl [1]>
10 34.6 <lgl [1]>
# ℹ 323 more rows
Still another variation of the map() function is the map_dbl(). It reports numbers instead of logical statements. Notice, we list first the column to be modified followed by the function (large_quantity).
Code
# A tibble: 333 × 2
bill_length_mm large_bill_length
<dbl> <dbl>
1 39.1 1
2 39.5 1
3 40.3 1
4 36.7 0
5 39.3 1
6 38.9 0
7 39.2 1
8 41.1 1
9 38.6 0
10 34.6 0
# ℹ 323 more rows
Lesson 3
This section covers the summarise()function. This function is the opposite of the mutate() function. It also works with vectors, but the difference is the amount of output. While mutate can be used to produce vectors, summarise turns out single elements, or atomic values.
Code
# A tibble: 1 × 2
mean_flipper_length sd_flipper_length
<dbl> <dbl>
1 201. 14.0
You can also save the lengths into one single list using the list() function. In other words, the last column, labeled flipper_lenghts() contains a vector of values. This last column has 333 values corresponding to 333 penguins’ flipper lengths.
Code
# A tibble: 1 × 3
mean_flipper_length sd_flipper_length flipper_lengths
<dbl> <dbl> <list>
1 201. 14.0 <int [333]>
Lesson 4
Handling missing values with common calculations. If not handled, the summarise function would generate NAs. One option is to use na.rm = TRUE.
Code
# A tibble: 1 × 3
mean_flipper_length sd_flipper_length flipper_lengths
<dbl> <dbl> <list>
1 201. 14.1 <int [344]>
Lesson 5
Repeated calculations for subsets of data. Two options to automate repeated calculations. for() loop. Another is to rely on loop calculations. An example is to rely on the data without missing values.
One option is to rely on the option .by = () to specify subgroups. Be careful about omitting the “.” It would produce an output but would miss summarizing results by species.
Code
# A tibble: 3 × 4
species mean_flipper_length sd_flipper_length flipper_lengths
<fct> <dbl> <dbl> <list>
1 Adelie 190. 6.52 <int [146]>
2 Gentoo 217. 6.59 <int [119]>
3 Chinstrap 196. 7.13 <int [68]>
Warning regarding grouping in a series of concatenated computations as show below. group_by() affects the original structure of the data by the first variable in listed in the group_by() function.
Notice the output indicates #Groups : species[3] Or 3 species by 5 islands. Adding another summarise() function generates a problem. Suppose we want to estimate the mean of the mean_flipper_length estimated in the previous run as well as the sd of the sd_flipper_lenght variable previously created. The end result is an sd_flipper_length with two NAs.
During the first calculation, group_by() strips the last group alluded in the function. In other words, the variable or column island is eliminated. The end result is a 5 X 5 tibble table organized by three groups corresponding to the 3 species. This is an opportunity to mistakes if one ignores the structure imposed on the data by group_by() function. Grouped calculations behave differently.
Code
# A tibble: 3 × 3
species mean_flipper_length sd_flipper_length
<fct> <dbl> <dbl>
1 Adelie 190. 0.255
2 Chinstrap 196. NA
3 Gentoo 217. NA
To remove the underlying grouping structure, remove the grouping using the option .groups = “drop’ in the summarise() function. This option restores the original tibble table of 5 X 6.
Code
# A tibble: 5 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Biscoe 189. 6.73 <int [44]>
2 Adelie Dream 190. 6.48 <int [55]>
3 Adelie Torgersen 192. 6.22 <int [47]>
4 Chinstrap Dream 196. 7.13 <int [68]>
5 Gentoo Biscoe 217. 6.59 <int [119]>
Another alternative to .groups = "drop" is the function .by = c(species,island) , which also restores the original data structure. Notice that in this case, the .by = function replaces the group_by() function.
Code
# A tibble: 5 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Torgersen 192. 6.22 <int [47]>
2 Adelie Biscoe 189. 6.73 <int [44]>
3 Adelie Dream 190. 6.48 <int [55]>
4 Gentoo Biscoe 217. 6.59 <int [119]>
5 Chinstrap Dream 196. 7.13 <int [68]>
Lesson 6
Unique function
Using grouping .by =function with mutate can produce a long list of means. In this case, a long list of means by species. Notice, the mean column only reports one single mean value in the screen. To see the rest of the means across species, you need to create a column, say test, and print it.
Code
# A tibble: 344 × 3
bill_length_mm flipper_length_mm mean_column
<dbl> <int> <dbl>
1 39.1 181 38.8
2 39.5 186 38.8
3 40.3 195 38.8
4 NA NA 38.8
5 36.7 193 38.8
6 39.3 190 38.8
7 38.9 181 38.8
8 39.2 195 38.8
9 34.1 193 38.8
10 42 190 38.8
# ℹ 334 more rows
In other to make sense of the the listing, let’s just select the species column and the mean column.
Code
# A tibble: 344 × 2
species mean_column
<fct> <dbl>
1 Adelie 38.8
2 Adelie 38.8
3 Adelie 38.8
4 Adelie 38.8
5 Adelie 38.8
6 Adelie 38.8
7 Adelie 38.8
8 Adelie 38.8
9 Adelie 38.8
10 Adelie 38.8
# ℹ 334 more rows
In order to avoid a long list of 344 means by species, use the unique() function. The unique function eliminates duplicates. There were many duplicates of the mean for each species which unique eliminated by reporting three species.
The .by = species forces that the calculations take place separately by the components of the column referenced to, in this case, 3 species of penguins. unique(), on the other hand, excludes duplicated means within each species or rows.
Centering a variable by its mean
Centering a variable represents an ideal situation to use the function .by = in a mutate operator instead of using summarise. A situation to use the scale function instead of reporting the means for each group as we did with mutate(). “scale is a generic R function whose default method centers and/or scales the columns of a numeric matrix.”
Notice, we are centering, or scaling each penguin’s species bill length to its mean. Assuring transparency by requesting alpha = 0.5.
Centering is a technique to help display the distribution as normal. It is also a popular technique in machine learning. It also helps to locate the three species’s distribution within a common line. It allows comparisons across species, or group scaling.
Code
Without centering, you can see that the three distribution have different scales. The lengths within each distribution vary between 30 to 60. So the location within each species is different.
Code
palmerpenguins::penguins |>
# mutate(
# bill_length_mm = scale(bill_length_mm),
# .by = species) |>
ggplot() +
geom_density(aes(x = bill_length_mm, fill = species),
alpha = 0.5) Another option is to center within each species instead of across species. As shown below it produced a graph without a common center. This is why if you want to make group comparisons, you need to use the .by option, which produces a common center or zero point.
Lesson 7
Missing data with group calculations
Notice not all combinations between species and islands are displayed. For instance Chinstrap -Biscoe, Chinstrap - Torgesen, Gentoo - Dream, and Gentoo-Torgeson combinations are not displayed.
Code
# A tibble: 5 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Torgersen 191. 6.23 <int [52]>
2 Adelie Biscoe 189. 6.73 <int [44]>
3 Adelie Dream 190. 6.59 <int [56]>
4 Gentoo Biscoe 217. 6.48 <int [124]>
5 Chinstrap Dream 196. 7.13 <int [68]>
To display all combinations between species and islands, even for those combinations with no observations we could use the following complete() function.
complete() allows one to display all combinations among variables even with missing data. It also allows to change the default NA value for another, say 0, using the fill() function.
Code
penguins_wo_NAs |>
summarise(
mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
flipper_lengths = list(flipper_length_mm),
.by = c(species,island) ) |>
complete(
species, island,
fill = list(mean_flipper_length = 0,
sd_flipper_length = 0)
)# A tibble: 9 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Biscoe 189. 6.73 <int [44]>
2 Adelie Dream 190. 6.48 <int [55]>
3 Adelie Torgersen 192. 6.22 <int [47]>
4 Chinstrap Biscoe 0 0 <NULL>
5 Chinstrap Dream 196. 7.13 <int [68]>
6 Chinstrap Torgersen 0 0 <NULL>
7 Gentoo Biscoe 217. 6.59 <int [119]>
8 Gentoo Dream 0 0 <NULL>
9 Gentoo Torgersen 0 0 <NULL>
Lesson 8
Using across() + mutate() instead of repetitions.
Original mutate
Code
# A tibble: 1 × 4
mean_bill_depth mean_bill_length mean_flipper_length mean_body_mass
<dbl> <dbl> <dbl> <dbl>
1 17.2 44.0 201. 4207.
Using ’across()` function. Several variations.
Also using glue() to affix a name to the corresponding output. This is done via the functions .names = 'mean_{.col}'. This glue function affix the legend mean_ to all the columns/variables referenced in the .cols = c( function.
Notice the output does not display the prefix mean_. If you want it you need the option .names , which is a glue specification.
Code
# A tibble: 1 × 4
mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Code
# A tibble: 1 × 4
mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Code
# A tibble: 1 × 5
mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
# ℹ 1 more variable: mean_year <dbl>
Code
# A tibble: 1 × 2
mean_bill_length_mm mean_bill_depth_mm
<dbl> <dbl>
1 44.0 17.2
Code
#Selecting columns with specific names
penguins_wo_NAs |>
summarise(across(starts_with('flipper'),
function(x) mean(x, na.rm = TRUE),
.names = 'mean_{.col}') )# A tibble: 1 × 1
mean_flipper_length_mm
<dbl>
1 201.
Removing unwanted strings
Note that we could remove unwanted strings such as _mm and _g.
Code
# A tibble: 1 × 4
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Another option is to include the .fn in the names function. But unfortunately it did not do the trick because the prefix mean was removed. Instead a prexif 1_ was added .
Code
# A tibble: 1 × 4
`1_bill_length` `1_bill_depth` `1_flipper_length` `1_body_mass`
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
To make it work correctly, we need to use .fns = list(mean = mean). This is the best option to display the names of the summary statistics called for.
Code
# A tibble: 1 × 4
avg_bill_length avg_bill_depth avg_flipper_length avg_body_mass
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
You can compute more summary statistics by adding estimation of standard deviation.
Code
# A tibble: 1 × 8
avg_bill_length standard_deviation_bil…¹ avg_bill_depth standard_deviation_b…²
<dbl> <dbl> <dbl> <dbl>
1 44.0 5.47 17.2 1.97
# ℹ abbreviated names: ¹standard_deviation_bill_length,
# ²standard_deviation_bill_depth
# ℹ 4 more variables: avg_flipper_length <dbl>,
# standard_deviation_flipper_length <dbl>, avg_body_mass <dbl>,
# standard_deviation_body_mass <dbl>
Lesson 9
Using across() + mutate( ) instead of repetitions
Using across() inside mutate() facilitates cleaning multiple columns in one single shot. In this example, we would be using across() along with the parse_number() to extract the first number inside a text.
We get a warning but the numbers were parsed or extracted from the strings. The end result is a tibble table with numbers instead of characters.
parse_number extracts the number of any string whose text has numbers embedded into them.
Code
# A tibble: 2 × 2
text1 text2
<dbl> <dbl>
1 10 30
2 20 40
If you want to include both numbers and text, you need to add a new column as follows:
Code
# A tibble: 2 × 4
text1 text2 number_text1 number_text2
<chr> <chr> <dbl> <dbl>
1 I have 10 apples I have 30 apples 10 30
2 I have 20 appples I have 40 apples 20 40
Lesson 10
Adding calculations next to across()
Recapping a prior example below. The operations listed after summarize is just one of the calculations that could be further added. Below the str_remove() removes text listed within parenthesis across all columns.
Code
# A tibble: 1 × 4
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Code
# A tibble: 1 × 4
Avg_bill_length Avg_bill_depth Avg_flipper_length Avg_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Adding more operations to summarise(). Say you want to add a standard deviation, say sd for mean_bill_length. All what you need to do is to add a line.
Code
# A tibble: 1 × 5
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
# ℹ 1 more variable: sd_bill_length <dbl>
Code
# A tibble: 1 × 5
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
# ℹ 1 more variable: sd_bill_length <dbl>
Lesson 11
Grouped calculations with across()
In the last lesson we learned that not only we can use across() for multiple computations, but also we can combine it with other analyses. This means we could do across calculations on a group basis as well. In short, we can perform powerful estimations in a concise manner.
In the example below, we can add the estimation of means and standard deviations by species and island.
Code
# A tibble: 5 × 7
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 2 more variables: mean_body_mass_g <dbl>, sd_bill_length <dbl>
Code
# A tibble: 5 × 7
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 2 more variables: mean_body_mass_g <dbl>, sd_bill_length <dbl>
An extension of this idea also apply to names =. See below
Code
# A tibble: 5 × 11
species island avg_bill_length standard_deviation_bill_l…¹ avg_bill_depth
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 3.03 18.5
2 Adelie Biscoe 39.0 2.48 18.4
3 Adelie Dream 38.5 2.48 18.2
4 Gentoo Biscoe 47.6 3.11 15.0
5 Chinstrap Dream 48.8 3.34 18.4
# ℹ abbreviated name: ¹standard_deviation_bill_length
# ℹ 6 more variables: standard_deviation_bill_depth <dbl>,
# avg_flipper_length <dbl>, standard_deviation_flipper_length <dbl>,
# avg_body_mass <dbl>, standard_deviation_body_mass <dbl>,
# sd_bill_lenght <dbl>
Lesson 12
Getting to know the reframe() function
reframe() is a function sitting between mutate and summarize. This function has an arbitrary length between one, which is associated with summarise() and multiple, which is associated with mutate().
The range() function is relevant. It tells you the minimum and maximum values of a column.
If we use range() within summarise() we get an error message, asking to use reframe() instead.
Code
range(penguins_wo_NAs$bill_length_mm)[1] 32.1 59.6
Code
# A tibble: 2 × 1
range_bill_lenght_mm
<dbl>
1 32.1
2 59.6
Since reframe() along range() would return unnamed results, we could add type() for adding the labels to be added into the output.
Code
# A tibble: 2 × 2
type range_bill_lenght_mm
<chr> <dbl>
1 min 32.1
2 max 59.6
We can do more things with reframe( ). We can use group statistics as well.
Code
# A tibble: 6 × 3
species type range_bill_lenght_mm
<fct> <chr> <dbl>
1 Adelie min 32.1
2 Adelie max 46
3 Gentoo min 40.9
4 Gentoo max 59.6
5 Chinstrap min 40.9
6 Chinstrap max 58
We can also add across() and request range() within .fns =. This approach insert the across( ) function within the reframe( )function.
Code
# A tibble: 6 × 6
species type bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <chr> <dbl> <dbl> <int> <int>
1 Adelie min 32.1 15.5 172 2850
2 Adelie max 46 21.5 210 4775
3 Gentoo min 40.9 13.1 203 3950
4 Gentoo max 59.6 17.3 231 6300
5 Chinstrap min 40.9 16.4 178 2700
6 Chinstrap max 58 20.8 212 4800
But we could also format it as a table. To do so, we need to call upon pivot _longer( ) function as well as the and pivot_wider( ) function.
Code
#Pivot longer
penguins_wo_NAs |>
reframe(
type = c("min", "max"),
across(.cols = c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g"),
.fns = range
),
.by = species) |>
pivot_longer(
cols = -c(1:2),
names_to = "quantity",
values_to = 'value'
) |>
pivot_wider(
id_cols = c(quantity),
names_from = c(species,type),
values_from = value
)# A tibble: 4 × 7
quantity Adelie_min Adelie_max Gentoo_min Gentoo_max Chinstrap_min
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm 32.1 46 40.9 59.6 40.9
2 bill_depth_mm 15.5 21.5 13.1 17.3 16.4
3 flipper_length_mm 172 210 203 231 178
4 body_mass_g 2850 4775 3950 6300 2700
# ℹ 1 more variable: Chinstrap_max <dbl>
Lesson 13
Motivation for tidyselect helpers. They are powerful techniques. Allows one to select the correct columns or variables. For instance, the .fns = allows one to label the variables. Then those labels are collected via the .names = function (see below):
Code
# A tibble: 5 × 10
species island avg_bill_length standard_deviation_bill_l…¹ avg_bill_depth
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 3.03 18.5
2 Adelie Biscoe 39.0 2.48 18.4
3 Adelie Dream 38.5 2.48 18.2
4 Gentoo Biscoe 47.6 3.11 15.0
5 Chinstrap Dream 48.8 3.34 18.4
# ℹ abbreviated name: ¹standard_deviation_bill_length
# ℹ 5 more variables: standard_deviation_bill_depth <dbl>,
# avg_flipper_length <dbl>, standard_deviation_flipper_length <dbl>,
# avg_body_mass_g <dbl>, standard_deviation_body_mass_g <dbl>
Instead of listing the columns where the means and standard deviations are to be computed, we could use tidy helpers to facilitate locating them. The tidyselect to be used is where. It works in combination with another function, is.numeric(), that finds the desired property of the column or variable.
Code
# A tibble: 5 × 7
species island avg_bill_length avg_bill_depth avg_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 2 more variables: avg_body_mass_g <dbl>, avg_year <dbl>
Notice this omnibus statement includes the column year, which is categorical. In other to exclude this kind of variable from the omnibus statement we can add the -(minus) to exclude the categorical variable. Notice we need to encircle it within the concatenate function c( ) . The script c(where(is.numeric), -year) concatenates the tidy helper while excluding the column year.
Code
# A tibble: 5 × 6
species island avg_bill_length avg_bill_depth avg_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: avg_body_mass_g <dbl>
Lesson 14
Another custom function for across() function. Using the original penguins dataset. Performing the function, you will get missing data.
How to avoid these situations of ending up with NAs. We can use the mean function with na.rm option. The way to do so is to specify a new function that also specifies na.rm = TRUE. We can do so via an anonymous function(x)
Two options:
Code
# A tibble: 5 × 6
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.4 191.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.3 190.
4 Gentoo Biscoe 47.5 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: mean_body_mass_g <dbl>
Code
# A tibble: 5 × 6
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.4 191.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.3 190.
4 Gentoo Biscoe 47.5 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: mean_body_mass_g <dbl>
Code
# A tibble: 5 × 6
species island avg_bill_length avg_bill_depth avg_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.4 191.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.3 190.
4 Gentoo Biscoe 47.5 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: avg_body_mass_g <dbl>
Lesson 15
Selecting the right data by column/variable types. Using select along with where(is.numeric) screens the database in search of variables that are numeric. We can eliminate as well variable with are not numeric.
The where() function can be used in several scenarios.
# A tibble: 333 × 4
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<dbl> <dbl> <int> <int>
1 39.1 18.7 181 3750
2 39.5 17.4 186 3800
3 40.3 18 195 3250
4 36.7 19.3 193 3450
5 39.3 20.6 190 3650
6 38.9 17.8 181 3625
7 39.2 19.6 195 4675
8 41.1 17.6 182 3200
9 38.6 21.2 191 3800
10 34.6 21.1 198 4400
# ℹ 323 more rows
Code
# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen female
5 Adelie Torgersen male
6 Adelie Torgersen female
7 Adelie Torgersen male
8 Adelie Torgersen female
9 Adelie Torgersen male
10 Adelie Torgersen male
# ℹ 323 more rows
Code
# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen female
5 Adelie Torgersen male
6 Adelie Torgersen female
7 Adelie Torgersen male
8 Adelie Torgersen female
9 Adelie Torgersen male
10 Adelie Torgersen male
# ℹ 323 more rows
Alternatively, you can define a custom function. Inside the where() function you can use the anonymous function. Combine the two with an or ||. Consequently the defined function is TRUE if the corresponding column/variable is a factor or a character. If numeric, it would yield a FALSE.
Code
penguins_wo_NAs |>
select(where(function(x) is.factor(x) || is.character(x)) )# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen female
5 Adelie Torgersen male
6 Adelie Torgersen female
7 Adelie Torgersen male
8 Adelie Torgersen female
9 Adelie Torgersen male
10 Adelie Torgersen male
# ℹ 323 more rows
Defining custom function to identify numeric columns. And we could add another condition such as identifying the mean to be above a particular value.
So here we defined an anonymous function that checks whether a variable is numeric and where the mean of the variable is above 200.
The difference between the double && and the single & operator is that the single operator generates a warning message. The argument is not numeric or logical.
Code
penguins_wo_NAs |>
select(where(function(x) is.numeric(x) && mean(x) > 200 ) )# A tibble: 333 × 3
flipper_length_mm body_mass_g year
<int> <int> <int>
1 181 3750 2007
2 186 3800 2007
3 195 3250 2007
4 193 3450 2007
5 190 3650 2007
6 181 3625 2007
7 195 4675 2007
8 182 3200 2007
9 191 3800 2007
10 198 4400 2007
# ℹ 323 more rows
Another handy option is to get columns where specific percentage of values are missing. In this case asking for variables with more than 1% of missing cases.
It usually practical to name this function for future use.
# A tibble: 344 × 1
sex
<fct>
1 male
2 female
3 female
4 <NA>
5 female
6 male
7 female
8 male
9 <NA>
10 <NA>
# ℹ 334 more rows
Code
# A tibble: 344 × 1
sex
<fct>
1 male
2 female
3 female
4 <NA>
5 female
6 male
7 female
8 male
9 <NA>
10 <NA>
# ℹ 334 more rows
Lesson 16
Select the right data by column name
Selecting consecutive columns. Between bill_length and sex column
Selecting columns by specific variable’s labels. Selecting variables ending with
_mmSelecting columns with start with a particular string. Selecting variables starting with `bill`.
Selecting columns that contain similar strings. Selecting variables with the string
lenght
Code
#Selecting consecutive columns
penguins_wo_NAs |>
select(bill_length_mm:sex)# A tibble: 333 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<dbl> <dbl> <int> <int> <fct>
1 39.1 18.7 181 3750 male
2 39.5 17.4 186 3800 female
3 40.3 18 195 3250 female
4 36.7 19.3 193 3450 female
5 39.3 20.6 190 3650 male
6 38.9 17.8 181 3625 female
7 39.2 19.6 195 4675 male
8 41.1 17.6 182 3200 female
9 38.6 21.2 191 3800 male
10 34.6 21.1 198 4400 male
# ℹ 323 more rows
# A tibble: 333 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <int>
1 39.1 18.7 181
2 39.5 17.4 186
3 40.3 18 195
4 36.7 19.3 193
5 39.3 20.6 190
6 38.9 17.8 181
7 39.2 19.6 195
8 41.1 17.6 182
9 38.6 21.2 191
10 34.6 21.1 198
# ℹ 323 more rows
Code
#Selecting columns starting with bill
penguins_wo_NAs |>
select(starts_with('bill'))# A tibble: 333 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 36.7 19.3
5 39.3 20.6
6 38.9 17.8
7 39.2 19.6
8 41.1 17.6
9 38.6 21.2
10 34.6 21.1
# ℹ 323 more rows
# A tibble: 333 × 2
bill_length_mm flipper_length_mm
<dbl> <int>
1 39.1 181
2 39.5 186
3 40.3 195
4 36.7 193
5 39.3 190
6 38.9 181
7 39.2 195
8 41.1 182
9 38.6 191
10 34.6 198
# ℹ 323 more rows
Using regular expression to extend the gamma and scope of search for variables. For instance search for variables that start with the letters f or b, and are followed by any amount of characters .* and then end with the string _mm. Or .*(_mm)$
# A tibble: 333 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <int>
1 39.1 18.7 181
2 39.5 17.4 186
3 40.3 18 195
4 36.7 19.3 193
5 39.3 20.6 190
6 38.9 17.8 181
7 39.2 19.6 195
8 41.1 17.6 182
9 38.6 21.2 191
10 34.6 21.1 198
# ℹ 323 more rows
Two more ways to select columns/variables based on a vector that contains the names. You can use this vector to select the columns or variables of interest. Two optional functions any_of() or all_of() functions to select the components of the vector.
Code
# A tibble: 333 × 1
bill_depth_mm
<dbl>
1 18.7
2 17.4
3 18
4 19.3
5 20.6
6 17.8
7 19.6
8 17.6
9 21.2
10 21.1
# ℹ 323 more rows
Code
# palmerpenguins::penguins |>
# select(all_of(my_cols))
# # all_of
# my_cols <- c("bill_lenght_mm", "bill_depth_mm")
# penguins_wo_NAs |>
# select(all_of(my_cols))Lesson 17
Bringing it all together.
Using the housing dataset from the ames database from modeldata library.
Code
| Name | clean_names(ames) |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| factor | 40 |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| ms_sub_class | 0 | 1 | FALSE | 16 | One: 1079, Two: 575, One: 287, One: 192 |
| ms_zoning | 0 | 1 | FALSE | 7 | Res: 2273, Res: 462, Flo: 139, Res: 27 |
| street | 0 | 1 | FALSE | 2 | Pav: 2918, Grv: 12 |
| alley | 0 | 1 | FALSE | 3 | No_: 2732, Gra: 120, Pav: 78 |
| lot_shape | 0 | 1 | FALSE | 4 | Reg: 1859, Sli: 979, Mod: 76, Irr: 16 |
| land_contour | 0 | 1 | FALSE | 4 | Lvl: 2633, HLS: 120, Bnk: 117, Low: 60 |
| utilities | 0 | 1 | FALSE | 3 | All: 2927, NoS: 2, NoS: 1 |
| lot_config | 0 | 1 | FALSE | 5 | Ins: 2140, Cor: 511, Cul: 180, FR2: 85 |
| land_slope | 0 | 1 | FALSE | 3 | Gtl: 2789, Mod: 125, Sev: 16 |
| neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
| condition_1 | 0 | 1 | FALSE | 9 | Nor: 2522, Fee: 164, Art: 92, RRA: 50 |
| condition_2 | 0 | 1 | FALSE | 8 | Nor: 2900, Fee: 13, Art: 5, Pos: 4 |
| bldg_type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
| house_style | 0 | 1 | FALSE | 8 | One: 1481, Two: 873, One: 314, SLv: 128 |
| overall_cond | 0 | 1 | FALSE | 9 | Ave: 1654, Abo: 533, Goo: 390, Ver: 144 |
| roof_style | 0 | 1 | FALSE | 6 | Gab: 2321, Hip: 551, Gam: 22, Fla: 20 |
| roof_matl | 0 | 1 | FALSE | 8 | Com: 2887, Tar: 23, WdS: 9, WdS: 7 |
| exterior_1st | 0 | 1 | FALSE | 16 | Vin: 1026, Met: 450, HdB: 442, Wd : 420 |
| exterior_2nd | 0 | 1 | FALSE | 17 | Vin: 1015, Met: 447, HdB: 406, Wd : 397 |
| mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1775, Brk: 880, Sto: 249, Brk: 25 |
| exter_cond | 0 | 1 | FALSE | 5 | Typ: 2549, Goo: 299, Fai: 67, Exc: 12 |
| foundation | 0 | 1 | FALSE | 6 | PCo: 1310, CBl: 1244, Brk: 311, Sla: 49 |
| bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 2616, Goo: 122, Fai: 104, No_: 80 |
| bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1906, Av: 418, Gd: 284, Mn: 239 |
| bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 859, Unf: 851, ALQ: 429, Rec: 288 |
| bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 2499, Rec: 106, LwQ: 89, No_: 81 |
| heating | 0 | 1 | FALSE | 6 | Gas: 2885, Gas: 27, Gra: 9, Wal: 6 |
| heating_qc | 0 | 1 | FALSE | 5 | Exc: 1495, Typ: 864, Goo: 476, Fai: 92 |
| central_air | 0 | 1 | FALSE | 2 | Y: 2734, N: 196 |
| electrical | 0 | 1 | FALSE | 6 | SBr: 2682, Fus: 188, Fus: 50, Fus: 8 |
| functional | 0 | 1 | FALSE | 8 | Typ: 2728, Min: 70, Min: 65, Mod: 35 |
| garage_type | 0 | 1 | FALSE | 7 | Att: 1731, Det: 782, Bui: 186, No_: 157 |
| garage_finish | 0 | 1 | FALSE | 4 | Unf: 1231, RFn: 812, Fin: 728, No_: 159 |
| garage_cond | 0 | 1 | FALSE | 6 | Typ: 2665, No_: 159, Fai: 74, Goo: 15 |
| paved_drive | 0 | 1 | FALSE | 3 | Pav: 2652, Dir: 216, Par: 62 |
| pool_qc | 0 | 1 | FALSE | 5 | No_: 2917, Exc: 4, Goo: 4, Typ: 3 |
| fence | 0 | 1 | FALSE | 5 | No_: 2358, Min: 330, Goo: 118, Goo: 112 |
| misc_feature | 0 | 1 | FALSE | 6 | Non: 2824, She: 95, Gar: 5, Oth: 4 |
| sale_type | 0 | 1 | FALSE | 10 | WD : 2536, New: 239, COD: 87, Con: 26 |
| sale_condition | 0 | 1 | FALSE | 6 | Nor: 2413, Par: 245, Abn: 190, Fam: 46 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lot_frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
| lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
| year_built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
| year_remod_add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
| mas_vnr_area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
| bsmt_fin_sf_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
| bsmt_fin_sf_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
| bsmt_unf_sf | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
| total_bsmt_sf | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
| first_flr_sf | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
| second_flr_sf | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
| gr_liv_area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
| bsmt_full_bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
| bsmt_half_bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| full_bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
| half_bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
| bedroom_abv_gr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
| kitchen_abv_gr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
| tot_rms_abv_grd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
| fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
| garage_cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
| garage_area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
| wood_deck_sf | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
| open_porch_sf | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
| enclosed_porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
| three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
| screen_porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
| pool_area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
| misc_val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
| mo_sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
| year_sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
| sale_price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
| longitude | 0 | 1 | -93.64 | 0.03 | -93.69 | -93.66 | -93.64 | -93.62 | -93.58 | ▅▅▇▆▁ |
| latitude | 0 | 1 | 42.03 | 0.02 | 41.99 | 42.02 | 42.03 | 42.05 | 42.06 | ▂▂▇▇▇ |
Focusing our attention to the numeric variables. We can also transform the variables a tibble
Code
# A tibble: 34 × 12
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 numeric lot_frontage 0 1 57.6 33.5
2 numeric lot_area 0 1 10148. 7880.
3 numeric year_built 0 1 1971. 30.2
4 numeric year_remod_add 0 1 1984. 20.9
5 numeric mas_vnr_area 0 1 101. 179.
6 numeric bsmt_fin_sf_1 0 1 4.18 2.23
7 numeric bsmt_fin_sf_2 0 1 49.7 169.
8 numeric bsmt_unf_sf 0 1 559. 440.
9 numeric total_bsmt_sf 0 1 1051. 441.
10 numeric first_flr_sf 0 1 1160. 392.
# ℹ 24 more rows
# ℹ 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
# numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>
Conducting summary statistics by focusing on variables that are continuos. To do so we define to define a custom function that checks for those variables that are counts. Notice the function screens weather the values of the variable range from 0 to 10. This anonymous function would give us a string of FALSE or TRUE values. Using the all() function to check if the values of the variables range from 0 to 10.
Next we can use this information to identify continuous variables. The ames_numeric anonymous function identified 9 count variables. iIf you want to identify continuous variables negage the !all(x %in% 0:10). This revised anonymous function yielded 25 variables that are continous
Code
# A tibble: 2,930 × 9
bsmt_fin_sf_1 bsmt_full_bath bsmt_half_bath full_bath half_bath
<dbl> <dbl> <dbl> <int> <int>
1 2 1 0 1 0
2 6 0 0 1 0
3 1 0 0 1 1
4 1 1 0 2 1
5 3 0 0 2 1
6 3 0 0 2 1
7 3 1 0 2 0
8 1 0 0 2 0
9 3 1 0 2 0
10 7 0 0 2 1
# ℹ 2,920 more rows
# ℹ 4 more variables: bedroom_abv_gr <int>, kitchen_abv_gr <int>,
# fireplaces <int>, garage_cars <dbl>
Using the data subset of continuous variables. Let’s focus on columns that contain the word area. Next identify variables with the sf string
# A tibble: 2,930 × 5
lot_area mas_vnr_area gr_liv_area garage_area pool_area
<int> <dbl> <int> <dbl> <int>
1 31770 112 1656 528 0
2 11622 0 896 730 0
3 14267 108 1329 312 0
4 11160 0 2110 522 0
5 13830 0 1629 482 0
6 9978 20 1604 470 0
7 4920 0 1338 582 0
8 5005 0 1280 506 0
9 5389 0 1616 608 0
10 7500 0 1804 442 0
# ℹ 2,920 more rows
# A tibble: 2,930 × 7
bsmt_fin_sf_2 bsmt_unf_sf total_bsmt_sf first_flr_sf second_flr_sf
<dbl> <dbl> <dbl> <int> <int>
1 0 441 1080 1656 0
2 144 270 882 896 0
3 0 406 1329 1329 0
4 0 1045 2110 2110 0
5 0 137 928 928 701
6 0 324 926 926 678
7 0 722 1338 1338 0
8 0 1017 1280 1280 0
9 0 415 1595 1616 0
10 0 994 994 1028 776
# ℹ 2,920 more rows
# ℹ 2 more variables: wood_deck_sf <int>, open_porch_sf <int>
Code
#Examining if some variables are the summ of two others
ames_non_counts |>
filter(gr_liv_area != first_flr_sf + second_flr_sf
)# A tibble: 40 × 25
lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_2
<dbl> <int> <int> <int> <dbl> <dbl>
1 50 5500 1929 2001 0 692
2 50 9350 1947 1979 0 0
3 60 10410 1916 1987 0 0
4 60 10410 1915 1950 0 0
5 60 12144 1949 1950 0 0
6 60 8064 1948 1994 0 0
7 98 8820 1890 1996 0 0
8 60 5400 1920 1950 0 0
9 92 5520 1912 1950 0 0
10 52 6240 1925 1950 0 0
# ℹ 30 more rows
# ℹ 19 more variables: bsmt_unf_sf <dbl>, total_bsmt_sf <dbl>,
# first_flr_sf <int>, second_flr_sf <int>, gr_liv_area <int>,
# tot_rms_abv_grd <int>, garage_area <dbl>, wood_deck_sf <int>,
# open_porch_sf <int>, enclosed_porch <int>, three_season_porch <int>,
# screen_porch <int>, pool_area <int>, misc_val <int>, mo_sold <int>,
# year_sold <int>, sale_price <int>, longitude <dbl>, latitude <dbl>
Focusing on sale price, and calculating averages. Notice, Albert introduces the function to request skewness. Skewness belongs to the e1071 library. And to displayed in a nice format, we could pivot_longer the results.
In pivoting_longer, we opt to arrange all of the columns using cols = everything(). Notice the median is lower than the mean signifying the distribution is positively skewed.
Code
library(tidyverse)
library(gt)
#Calculating averages using across
ames |>
summarise(
across(
.cols = sale_price,
.fns = list(
min = min,
avg = mean,
median = median,
max = max,
standard_dev = sd,
skewness = e1071::skewness),
.names = '{.fn}'
)
) |>
#Pivoting longer to create a summary stats table
pivot_longer(
cols = everything()
)# A tibble: 6 × 2
name value
<chr> <dbl>
1 min 12789
2 avg 180796.
3 median 160000
4 max 755000
5 standard_dev 79887.
6 skewness 1.74
The neighborhood column has 28 values, which makes it difficult to handle. Consequently, we could lump together the values. The functionfct_lump_n() lumps all levels except for the n most frequent (or least frequent if n < 0). In this case, we are focusing on the top 5 most frequent categories lumping together the rest. It produces 6 neighborhoods. Other is a collection of many neighborhoods.
Code
# A tibble: 28 × 2
neighborhood n
<fct> <int>
1 North_Ames 443
2 College_Creek 267
3 Old_Town 239
4 Edwards 194
5 Somerset 182
6 Northridge_Heights 166
7 Gilbert 165
8 Sawyer 151
9 Northwest_Ames 131
10 Sawyer_West 125
11 Mitchell 114
12 Brookside 108
13 Crawford 103
14 Iowa_DOT_and_Rail_Road 93
15 Timberland 72
16 Northridge 71
17 Stone_Brook 51
18 South_and_West_of_Iowa_State_University 48
19 Clear_Creek 44
20 Meadow_Village 37
21 Briardale 30
22 Bloomington_Heights 28
23 Veenker 24
24 Northpark_Villa 23
25 Blueste 10
26 Greens 8
27 Green_Hills 2
28 Landmark 1
Code
ames |>
mutate(neighborhood = fct_lump_n(
neighborhood,
n = 5)
) |>
skim(neighborhood)| Name | mutate(ames, neighborhood… |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighborhood | 0 | 1 | FALSE | 6 | Oth: 1605, Nor: 443, Col: 267, Old: 239 |
Code
ames_top_five <- ames |>
mutate(neighborhood = fct_lump_n(
neighborhood,
n = 5)
)
ames_top_five |>
select(neighborhood) |>
count(neighborhood) |>
print(n = Inf)# A tibble: 6 × 2
neighborhood n
<fct> <int>
1 North_Ames 443
2 College_Creek 267
3 Old_Town 239
4 Edwards 194
5 Somerset 182
6 Other 1605
With this new database, we can do data visualizations comparing the house prices across those 5 neighborhoods. We can use facet_wrap for the neighborhood. We can also use the scale_x_log10() to normalize the x-axis.
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)We can make this figure more informative by adding labels for the minimum and maximum sale price. To do so, we can add a geom_text() layer. The values for the layer can be extrated from the function reframe() .
The reframe() function allows to calculate the sale price range. It would give a tibble that has the range of prices . Next we can add the label type to capture the labels minimum and maximum. And lets create those values by neighborhood.
We can save the values in a dataframe, say ranges
Code
# A tibble: 12 × 3
neighborhood type sale_price
<fct> <chr> <int>
1 North_Ames min 68000
2 North_Ames max 345000
3 Other min 13100
4 Other max 755000
5 Somerset min 139000
6 Somerset max 468000
7 Old_Town min 12789
8 Old_Town max 475000
9 Edwards min 35000
10 Edwards max 415000
11 College_Creek min 110000
12 College_Creek max 475000
Modifying ggplot to add a geom_text() layer that captures the labels associated to each of the 5 neighborhoods. In the geom_text(), we specify the data, the aes(). For y we specify -1 and for x we rely on sale_price’s ranges, and for label we refer to the variable containing the sale price. Notice we rely on y = -1 to place the labels below the chart.
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
geom_text(data = ranges,
aes(y = -1, label = sale_price)) +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)Our first attempt with the labels produce characters to big. We can do this legend nicer by passing it to scales::dollar() function.
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
geom_text(data = ranges,
aes(y = -1, label = sale_price |>
scales::dollar()) ) +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)To make it more obvious what those labels are, we can add another geom_point() layer. It adds points to the min and max. We can also shape as a triangle-like instead of a point (shape = 5).
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
geom_point(data = ranges,
aes(y = -0.3),
shape = 5,
size = 2) +
geom_text(data = ranges,
aes(y = -1.2, label = sale_price |>
scales::dollar()), size = 2.5 ) +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)Another data visualization, but this time dealing with count variables. In this case, we identified 9 count variables. Let’s check the column names. There are variables that have sf and
Code
# ames_numeric |>
# select(where(function(x) all(x %in% 0:10) ) ) |>
# colnames()
# We can reduce the number of counts lower. We end up with 7 variables instead of 9. We can also exclude variables related to basement and kitchen
Code
# selected_counts <- ames_numeric |>
# select(where(function(x) all(x %in% 0:5) ) ) |>
# select(
# -contains('bsmt'),
# -contains('kitchen')
# )
#
# selected_counts |>
# skim()We can mutate again and with across(), and transforming the count variables into factor variables. Notice that for most variables the frequencies gravitate around 0 or 1. Consequently, we could lump together the variables.
Code
# selected_counts |>
# mutate(
# across(
# .cols = everything(),
# .fns = factor )
# ) |>
# skim()Lumping factor variables
Lumping together factor variables with limited range. Instead of using fct_lump_n, we could use if_else to create variables capturing the lump sum. In so doing, we can use across() along an anonymous or custom function that leaves numbers smaller than 2 as is, but numbers greater than 2 are replaced with the text 2+. The function to do so is if_else().
Code
# selected_counts |>
# mutate(
# across(
# .cols = everything(),
# .fns = function(x) if_else(x < 2, x, '2+')
# ) )However, our function generated an error since we are dealing with numeric values not characters. Consequently, we need to transform all the variablse into a character format. And the skim() function reveals we have three unique values
Code
# selected_counts |>
# mutate(
# across(
# .cols = everything(),
# .fns = function(x) if_else(x < 2,
# as.character(x), '2+')
# ) ) |>
# skim()And then we could also transform back into a factor by adding the factor() function at the end.
Code
# selected_counts |>
# mutate(
# across(
# .cols = everything(),
# .fns = function(x) if_else(x < 2,
# as.character(x),
# '2+'
# ) |>
# factor(levels = c(0:1,'2+')) ) ) |>
# skim()We can modify the dataset to include the house sale price
And then, we have to make sure our across() call includes all variables, but the sale_price column. And now we have all count variables but sale_price which is a continuous variable.
The dataset needs to be transformed into a long format for ggplot(). We need to rearrange everything but the sale price.
Code
# grouped_counts |>
# pivot_longer(
# cols = -sale_price
# ) |>
# ggplot(aes(x = sale_price)) +
# geom_density(fill = 'grey80')
#
# grouped_counts |>
# pivot_longer(
# cols = -sale_price
# ) |>
# ggplot(aes(x = sale_price)) +
# geom_density(fill = 'grey80') +
# scale_x_log10(label = scales::label_dollar()) +
# facet_grid(
# rows = vars(name),
# cols = vars(value)
# ) +
# theme_minimal(
# base_family = 'Source Sans Pro',
# base_size = 16
# )We can center all distributions to their mean price using the scale() function. But we need to scale the distributions by the type of amenity. Notice we got lots of warning messages because of negative values that led to infinite values during the log-10 transformation.
Code
# grouped_counts |>
# pivot_longer(
# cols = -sale_price
# ) |>
# mutate(
# sale_price = scale(sale_price),
# .by = name
# ) |>
# ggplot(aes(x = sale_price)) +
# geom_density(fill = 'grey80') +
# scale_x_log10(label = scales::label_dollar()) +
# facet_grid(
# rows = vars(name),
# cols = vars(value)
# ) +
# theme_minimal(
# base_family = 'Source Sans Pro',
# base_size = 16
# )Changing the log-10 transformation
Code
# grouped_counts |>
# pivot_longer(
# cols = -sale_price
# ) |>
# mutate(
# sale_price = scale(sale_price),
# .by = name
# ) |>
# ggplot(aes(x = sale_price)) +
# geom_density(fill = 'grey80') +
# scale_x_continuous() +
# facet_grid(
# rows = vars(name),
# cols = vars(value)
# ) +
# theme_minimal(
# base_family = 'Source Sans Pro',
# base_size = 16
# )However, the scaling works better if you log scale the variable before
Code
# grouped_counts |>
# pivot_longer(
# cols = -sale_price
# ) |>
# mutate(
# sale_price = scale(sale_price |>
# log()),
# .by = name
# ) |>
# ggplot(aes(x = sale_price)) +
# geom_density(fill = 'grey80') +
# scale_x_continuous() +
# facet_grid(
# rows = vars(name),
# cols = vars(value)
# ) +
# theme_minimal(
# base_family = 'Source Sans Pro',
# base_size = 16
# )